Determining a level of expertise of a text using classification and application to information retrival

ABSTRACT

An apparatus and method are provided for determining a level of expertise applicable to a particular document and for using this determined level of expertise in an improved information retrieval arrangement. A trainable document classifier is used to identify an applicable level of expertise using a metric indicative of the commonality, as measured with reference to a reference corpus, of terms comprised in a given document, trained using a training set of documents comprising, for each of a plurality of predetermined levels of expertise, a representative sample of documents and their respective metric values. An information retrieval apparatus is arranged to identify documents relevant to a specified category of information and to select from documents so identified those having a level of expertise, determined by the trained document classifier, matching a specified level of expertise for a target user in respect of that category of information.

This invention relates to information retrieval and in particular to amethod and apparatus for identifying and retrieving information takingaccount of a level of expertise likely to be required of a useraccessing it, and to a particular method and apparatus for determiningthe level of expertise applicable to a given set of information.

It is known to classify documents according to a number of differentcriteria, in particular according to information topic. Numerous priorart techniques have been devised to achieve automatic or semi-automaticclassification of documents. Known classification techniques have beenapplied in particular to information retrieval arrangements to group orto help locate documents relating to particular topics of interest.However, while a search for relevant documents may be successful inlocating a number of documents relevant to a particular topic ofinterest, the intended audience for each document will vary and manylocated documents may prove unsuitable for particular users, being forexample too general for a specialised user having significant expertisein the topic.

According to a first aspect of the present invention there is provided amethod for determining a measure of the level of expertise applicable toan information data set, comprising the steps of:

(i) selecting, in respect of each of a plurality of predetermined levelsof expertise, a representative sample set of information data sets;

(ii) determining, for each of said selected information data sets, thevalue of a metric indicative of the incidence, in a reference corpus ofinformation, of terms comprised in the selected data set; and

(iii) using the values of said metric determined in step (ii) to trainan information classifier to identify at least one of said plurality ofpredetermined levels of expertise applicable to an information data setusing a value of said metric determined for the information data set.

The metric chosen for use in preferred embodiments of the presentinvention has the property that the values of the metric, calculated fordifferent representative samples of data sets in a training set selectedin step (i) above, fall within substantially distinct ranges. Thisenables a document classifier to be trained to rate a given informationdata set according to which of the predetermined levels of expertise ismost applicable, based solely upon the value of the metric calculatedfor the information data set being rated.

A value for the metric is calculated with reference to a referencecorpus of information in a relevant language. In preferred embodimentsof the present invention, the reference corpus used is the BritishNational Corpus, referenced below, although an equivalent corpus may beavailable in respect of languages other than English. The referencecorpus provides a measure, for each term, of the incidence of that termin the language represented by the corpus. For the purposes of thepresent patent application, “term” is intended to relate to a word orphrase or part of a word, e.g. a stemmed word. Different morespecialised corpi of information may be selected, for example a corpusrepresentative of the use of terms in speech, a corpus representative ofwritten use, or a corpus of children's literature in a particularlanguage.

Preferably the metric comprises a combined measure of the incidencewithin an information data set of terms comprised in the informationdata set and of the incidence of each said term in the reference corpus.In this way, the observed incidence of a particular term in thereference corpus may be weighted more highly, and hence contribute moreto the value of the metric, the more frequently that term is found tooccur in the information data set being rated. A preferred formula forcalculating values for the metric is given in the detailed descriptionbelow.

Preferably, training the classifier comprises:

(a) making distributions of normalised values of said metric for datasets in each of the representative sample sets selected at step (i),above; and

(b) for each of said predetermined levels of expertise, identifying fromsaid distributions a corresponding range of normalised values of saidmetric.

Normalised values of the metric are obtained, in a preferred embodimentof the present invention, by taking account of the length of theinformation data set being rated in comparison with the mean length ofdata sets used to construct the reference corpus.

In a preferred embodiment of the present invention, the trainedclassifier is arranged to determine a measure of the probability that aparticular one of said predetermined levels of expertise is applicableto the information data set being rated. For example, if it is foundthat distributions of the calculated values of the metric for thetraining samples of data sets are overlapping to some degree, then theremay be more than one level of expertise yielding a non-zero probabilityof association with information data set being rated. An outputexpressed in the form of probabilities for each predetermined level ofexpertise may be particularly useful in fuzzy processing arrangements.

Preferably, determining a value for said metric comprises applying astemming algorithm to stem terms comprised in a respective informationdata set and determining the incidence of the stemmed terms in thereference corpus. In particular, a algorithm such as Porter, M. F.,1980, “An algorithm for suffix stripping”, Program, 14(3):130-137, sincereprinted in Sparck Jones, Karen, and Peter Willet, 1997, Readings inInformation Retreval, San Francisco: Morgan Kaufmann, ISBN1-55860-454-4, may be used to stem terms prior to obtaining theirmeasure of incidence in the reference corpus.

According to a second aspect of the present invention there is provideda method of accessing information data sets, stored in an informationsystem, relevant to search criteria specifying an indication of acategory of information to be accessed and an indication of apredetermined level of expertise in respect of said category ofinformation, the method comprising the steps of:

(i) selecting a training set of information data sets comprising, foreach of a predetermined plurality of levels of expertise, arepresentative sample set of information data sets;

(ii) determining, for each data set in the training set, the value of ametric indicative of the incidence, in a reference corpus ofinformation, of terms comprised in the training data set;

(iii) using the values of said metric determined in step (ii) to trainan information classifier to identify at least one of said predeterminedplurality of levels of expertise applicable to a given information dataset;

(iv) applying an information searching algorithm to identify informationdata sets stored in said information system relevant to said specifiedcategory of information; and

(v) using the classifier trained at step (iii) to determine respectivelevels of expertise for information data sets identified at step (iv)and comparing the determined levels of expertise with the level ofexpertise specified in said search criteria to thereby select relevantinformation data sets.

When searching for documents relevant to a particular category ofinformation, by taking account also of the level of expertise of a userinitiating the search in that information category and matching theuser's level of expertise with that determined as being necessary fordocuments identified in the search, the search results selected forpresentation to that user are likely to be more useful than those in asimilar arrangement that otherwise ignores the intended level ofexpertise of readers of identified documents.

According to a third aspect of the present invention there is providedan apparatus for determining a level of expertise applicable to aninformation data set, the level of expertise being selected from apredetermined plurality of levels of expertise, the apparatuscomprising:

an input for receiving an information data set;

calculating means arranged with access to a reference corpus ofinformation to calculate, for an information data set, the value of ametric indicative of the incidence, in the reference corpus, of termscomprised in the information data set;

a trainable classifier; and

training means for training said classifier to identify, using atraining set of information data sets comprising, for each of saidpredetermined plurality of levels of expertise, a representative sampleset of information data sets and respective values of said metric, anapplicable level of expertise selected from said predetermined pluralityof levels of expertise for a received information data set;

wherein, in operation, on receipt of an information data set at saidinput, said calculating means are arranged to calculate a respectivevalue for said metric and to input the calculated value to saidtrainable classifier, trained by said training means, to determine andoutput an indication of at least one of said predetermined plurality oflevels of expertise applicable to said received information data set.

According to a fourth aspect of the present invention there is providedan information retrieval apparatus for accessing information data sets,stored in an information system, relevant to received search criteriaspecifying an indication of a category of information to be accessed andan indication of a predetermined level of expertise in respect of saidcategory of information, the apparatus comprising:

calculating means arranged with access to a reference corpus ofinformation to calculate, for an information data set, the value of ametric indicative of the incidence, in the reference corpus, of termscomprised in the information data set;

a trainable classifier;

training means for training said classifier to identify, using atraining set of information data sets comprising, for each of apredetermined plurality of levels of expertise, a representative sampleset of information data sets and respective values of said metric, anapplicable level of expertise selected from said predetermined pluralityof levels of expertise for a given information data set;

searching means for identifying information data sets in saidinformation system relevant to said specified category of information tobe accessed; and

selecting means arranged to trigger said calculating means to calculatevalues of said metric for information data sets identified by saidsearching means, to input the values so calculated to said trainableclassifier, trained by said training means, to determine and outputrespective applicable levels of expertise selected from saidpredetermined plurality of levels of expertise, and to select, foraccess, information data sets from those identified by said searchingmeans having respectively determined levels of expertise that match saidspecified level of expertise.

According to a fifth aspect of the present invention there is providedan information retrieval apparatus for accessing information data sets,stored in an information system, relevant to received search criteriaspecifying an indication of a category of information to be accessed andto a specified indication of a predetermined level of expertise inrespect of said category of information, the apparatus comprising:

calculating means arranged with access to a reference corpus ofinformation to calculate, for an information data set, the value of ametric indicative of the incidence, in the reference corpus, of termscomprised in the information data set;

an information classifier, trained, using, for each of a plurality ofpredetermined levels of expertise, a representative sample set oftraining information data sets and respective values of said metric, todetermine a level of expertise, selected from said plurality ofpredetermined levels of expertise, applicable to an information dataset;

searching means for identifying information data sets in saidinformation system relevant to said specified category of information tobe accessed; and

selecting means arranged to trigger said calculating means to calculatevalues of said metric for information data sets identified by saidsearching means, to input the values so calculated to said informationclassifier to determine and output respective applicable levels ofexpertise selected from said plurality of predetermined levels ofexpertise, and to select, for access, information data sets from thoseidentified by said searching means having respectively determined levelsof expertise that match said specified level of expertise.

A apparatus according to the fifth aspect of the present invention maybe supplied with a ready-trained information classifier rather than onethat has yet to be trained. An information classifier already trainedusing a general cross-section of training information data sets has beenfound to provide an acceptable level of performance when used to accessinformation data sets across a range of information categories.

Preferred embodiments of the present invention will now be described, byway of example only, with reference to the accompanying drawings ofwhich:

FIG. 1 is a diagram showing a trainable document classifier usable in anapparatus according to a first embodiment of the present invention;

FIG. 2 is a diagram showing typical distributions of a preferred metricfor a training sample of documents;

FIG. 3 is flow diagram showing steps in a preferred training process;

FIG. 4 is a flow diagram showing preferred steps in operation of theapparatus of FIG. 1; and

FIG. 5 is an information retrieval apparatus according to a secondembodiment of the present invention.

This invention arises from the observation by the inventors in thepresent case that a metric comprising a statistical measure of the“commonality” of terms occurring in a document with reference to acorpus of information representative of the use of words in a particularlanguage can be used to train a conventional document classifier todistinguish those documents intended for general readership from thosedirected to a more expert reader. In the English language in particular,this metric may be calculated preferably with reference to the BritishNational Corpus—a 100,000,000 word electronic databank sampled from thewhole range of present-day English, spoken and written. Word frequenciesfor the British National Corpus have been published for example in “WordFrequencies in Written and Spoken English: based on the British NationalCorpus.” by Geoffrey Leech, Paul Rayson and Andrew Wilson, published(2001) by Longman, London, ISBN 0582-32007-0 (Paperback).

A first embodiment of the present invention will now be described withreference to FIG. 1.

Referring to the diagram of FIG. 1, a trained document classifier 100 isshown that has been trained, by a process to be described below, todetermine and to output a rating corresponding to one of a number ofpredefined levels of expertise to be associated with a given document105, or to determine and to output a probability that the given document105 relates to one or more of those predefined levels of expertise. Ametric calculator 110 is arranged with access to a reference corpus 115of information in a particular language to enable it to calculate, forthe given document 105, the value of a metric, to be defined below,indicative of the “commonality” of terms occurring in the document 105.The classifier 100 has been trained to use a value of the metriccalculated by the metric calculator 110 to determine the appropriatelevel of expertise to associate with the document 105. The expertiserating output by the trained classifier 100 may be used in a number ofdifferent applications, in particular in an improved informationretrieval arrangement where only those documents that match a user'smeasure of expertise in a particular field of information are selectedfrom a set of search results for presentation to the user.

A preferred metric found to be suitable for use with a documentclassifier 100 to determine an expertise rating for a given document 105is derived as follows. A value a is first calculated, by the metriccalculator 110, for the given document 105 using the formula${\alpha = {\sum\limits_{i}\quad{\log\quad\left( {{tf}_{i} + 1} \right)\quad\log\quad\left( \frac{N}{n\quad(i)} \right)}}}\quad$

where tf₁ is the term frequency within the given document 105 of thei-th distinct (preferably stemmed using the algorithm referenced above)term of the given document 105,

n(i) is the number of documents in the reference corpus 115 containingthe i-th distinct (stemmed) term of the given document 105 and

N is the total number of documents in the reference corpus 115.

Preferably the value of n(i)/N is available directly as output from aninterface to the reference corpus 115 for any particular stemmed term.For example, for a particular stemmed term, the reference corpus 115returns a value representing the frequency with which the particularstemmed term occurs per million terms in the corpus 115.

The preferred metric then calculated by the metric calculator 110 is a“normalised” value for α, obtained by dividing α by a value β, where βis defined by:$\beta = \frac{{length\_ of}{\_ the}{\_ given}{\_ document}}{{mean\_ length}{\_ of}{\_ documents}{\_ in}{\_ the}{\_ reference}{\_ corpus}}$

It has been found that when the values for this preferred metric α/β areplotted for a range of documents, those documents typically directed to“expert” readers in a particular field have a substantially distinctrange of values for α/β in comparison with that for documents intendedfor more “general” readership. The differences in the two distributionscan be seen, for a particular sample of documents, in FIG. 2.

Referring to FIG. 2, two distributions are shown, one distribution 200for a sample of documents known to be intended for “general” readershipand one distribution 205 for a sample of documents known to be intendedfor more “expert” readership. If more than two levels of expertise areto be distinguished, then samples of documents may be selectedrepresentative of one or more intermediate levels of expertise and thecorresponding distributions plotted. Distributions may also be made inrespect of samples of documents distinguishing “child” from “adult”levels of “expertise”.

There are numerous variations to the formulae provided above forcalculating a and β of the preferred metric, for use in preferredembodiments of the present invention, that would be apparent to a personof ordinary skill, each variation taking account of the “commonality” ofterms occurring within a given document. In addition, there are numerousvariations in the way in which terms of a given document may be selectedfor use in calculating a value for the preferred metric. For example,rather than considering every term within a given document, a knownalgorithm may be used to select terms most likely to be indicative ofthe information content of the given document, for example an algorithmto extract so-called “key terms” as described in European patent numberEP 1032896 by the present Applicants. In a further variation, thereference corpus 115 used in preferred embodiments of the presentinvention may be selected from a range of specialised corpi according tothe particular information topic of documents under consideration or,more generally, according to whether the documents under considerationrelate to technical or non-technical subject matter, or to children'sliterature for example.

Having determined a suitable metric as defined above, the next step isto use that metric to train a document classifier either to identifywhich of the predefined levels of expertise to associate with a givendocument 105, or to determine a set of probabilities that a givendocument 105 is associated with one or more of the predefined levels ofexpertise. To this end, steps in a preferred training process will nowbe described with reference the flow diagram of FIG. 3.

Referring to FIG. 3, the training process begins with, at STEP 300,selection of a training set of documents comprising, for each of thepredetermined levels of expertise to be applied, a representativetraining sample of documents known to contain subject matter expressedin a way suitable for readers having that level of expertise, e.g.“expert” readers or those with only a “general” appreciation of a giveninformation topic. In practice, while the training set of documents mayrelate to a particular information topic and a different training set ofdocuments may be selected for each information topic, it has been foundthat a more general training set yields acceptable results when used torate documents relating to a number of different information topics. AtSTEP 305, the value for the preferred metric α/β is calculated, forexample by the metric calculator 110, for each of the documents in thetraining set. At STEP 310, knowing the level of expertise associatedwith each document of the training set and the corresponding values forα/β, a conventional document classifier is trained to associate a givendocument 105 with one of the predefined levels of expertise on the basisof a respective value for α/β. Preferably, the document classifier maybe trained at STEP 310 by making distributions of document frequency inthe respective training sample sets for values of α/β, as in FIG. 2, andon the basis of the document frequency distributions for each sample,determining the range of values of α/β corresponding to each of thepre-defined levels of expertise (there being two levels ofexpertise—“General” and “Expert”—in the example of FIG. 2).Alternatively, if required, the document classifier 100 may be arranged,after training, to output probability values in respect of each of thepredefined levels of expertise yielding a non-zero probability for thegiven document 105.

Steps in a preferred process, operable by the apparatus of FIG. 1, fordetermining the level of expertise for a given document 105, will now bedescribed with reference to the flow diagram of FIG. 4.

Referring to FIG. 4, the preferred process begins at STEP 400 withreceipt of a document 105 to be rated. At step 405 the value of thepreferred metric α/β is calculated by the metric calculator 110 for thereceived document 105 using the formulae provided above, with referenceto the reference corpus 115. Preferably, when accessing the referencecorpus 115 to obtain a relative frequency score for a stemmed form of aparticular term, if the reference corpus 115 provides relative frequencyscores for homonyms of the particular term, the metric calculator 110 isarranged to sum the relative frequencies provided for each homonym. Thatis, no attempt is made by the metric calculator 110 to distinguish useof a particular term in a given document 105 as a preposition from itsuse as an adjective, for example, before obtaining the relativefrequency score from the reference corpus 115. However, the metriccalculator 110 may be arranged optionally to implement a known algorithmto analyse terms in the given document 105 and to identify theparticular use of each term before obtaining the respective score forthat use of the term from the reference corpus 115.

The resultant value for α/β is input, at STEP 410, to the traineddocument classifier 100, preferably trained according to the process ofFIG. 3, and at STEP 415 the trained document classifier 100 outputseither an indication of the level of expertise to associate with thereceived document 105 or a set of probabilities that the receiveddocument 105 is associated with each of one or more of the levels ofexpertise. This latter output is of particular use in fuzzy processingsystems.

A preferred information retrieval apparatus will now be described withreference to FIG. 5, incorporating the trained document classifier 100of FIG. 1 in a preferred embodiment of the present invention.

Referring to FIG. 5, an information retrieval software agent 500 isarranged to operate on behalf of a user to identify documents relevantto the users submitted search criteria 505. Search criteria 505typically comprise a set of keywords/phrases relating to a particularcategory of information sought by the user. The information retrievalsoftware agent 500 is arranged with access to a user profile store 510wherein a predefined user profile may be stored for the user, theprofile containing an indication of the level of expertise of the userin respect of the particular category of information being sought.However, the level of expertise of the user submitting the searchcriteria 505 may optionally be specified within the search criteria 505,so obviating the need for the information retrieval software agent 500to make a separate access to the user profile store 510 to obtain theuser's expertise level.

The information retrieval software agent 500 is arranged with access tothe Internet 515 and hence to one or more search engines 520 to helpidentify and retrieve sets of information stored on web servers 525relevant to the user's submitted search criteria 505. The informationretrieval software agent 500 is also arranged with access to a traineddocument classifier 100 as above, by way of a metric calculator 110arranged with access to a reference corpus 115 for calculating a valuefor the metric α/β, as defined above, for a particular document, whichvalue when input to the trained document classifier 100 enables thelevel of expertise associated with the particular document to bedetermined. The information retrieval software agent 500 is arranged tooutput a list of search results 530 in response to the users submittedsearch criteria 505, the search results 530 being tailored both to theuser's specified category of information (505) and to the user's levelof expertise (510) with respect to that category of information (505).

In operation, the information retrieval software agent 500 is arranged,on receipt of search criteria 505 submitted by a user, to access theuser's personal profile 510 to determine the level of expertise of theuser in respect of the category of information represented by thesubmitted criteria 505, assuming that the user has not specified his/herlevel of expertise within the search criteria 505. The informationretrieval software agent 500 then accesses search engines 520 or webservers 525 directly to identify and retrieve sets of informationrelevant to the information category specified in the submitted searchcriteria 505, by conventional means. As relevant information sets areidentified and received, the information retrieval software agent 500determines the level of expertise to be associated with each relevantinformation set using functionality provided by the metric calculator110 and the trained document classifier 100, as described above withreference to FIG. 4. The information retrieval software agent 500compares the level of expertise determined for each relevant informationset with the level of expertise (510) of the user and thereby selects,to output to the user as search results 530, a set of relevantinformation sets having determined levels of expertise matching theuser's level of expertise.

In a further embodiment of the present invention a trained documentclassifier 100 may be used to derive a measure of the level of expertiseof a user in respect of a particular information topic. By monitoringinformation retrieval activity of a user in respect of the particularinformation topic, those documents that the user evidently finds useful,for example because the user retrieves a whole document to read orprovides feedback as to the usefulness of the document, may be input tothe metric calculator 110 and the respective metric values input to thetrained document classifier 100 to determine the level of expertise toassociate with these “useful” documents and hence, by implication, thelevel of expertise of the user in the information topic that thosedocuments represent.

It would be apparent to a person of ordinary skill in this field ofinformation retrieval, that preferred embodiments of the presentinvention may be applied in other information retrieval arrangements inwhich the expertise of a user may be taken into account when selectinginformation for presentation to that user or otherwise used in respectof that user.

1. A method for determining a measure of the level of expertiseapplicable to a given information data set, comprising the steps of: (i)selecting, in respect of each of a plurality of predetermined levels ofexpertise, a representative sample set of information data sets; (ii)determining, for each of said selected information data sets, the valueof a metric indicative of the incidence, in a reference corpus ofinformation, of terms comprised in the selected information data set;and (iii) using the values of said metric determined in step (ii) totrain an information classifier to identify, from a value of said metriccalculated for the given information data set, at least one of saidplurality of predetermined levels of expertise applicable to the giveninformation data set.
 2. A method as in claim 1, wherein said metriccomprises a combined measure of the incidence within an information dataset of terms comprised in the information data set and of the incidenceof each said term in the reference corpus.
 3. A method as in claim 1,wherein at step (iii), training the classifier comprises: (a) makingdistributions of normalised values of said metric for data sets in eachof the representative sample sets selected at step (i); and (b) for eachof said predetermined levels of expertise, identifying from saiddistributions a corresponding range of normalised values of said metric.4. A method as in claim 1, wherein at step (iii), the trained classifieris arranged to determine a measure of the probability that a particularone of said predetermined levels of expertise is applicable to theinformation data set.
 5. A method as in claim 1, wherein determining avalue for said metric comprises applying a stemming algorithm to stemterms comprised in a respective information data set and determining theincidence of the stemmed terms in the reference corpus.
 6. A method asin claim 1, wherein the reference corpus is provided with an interfacefor outputting the relative frequency of occurrence in the corpus of aterm.
 7. A method of accessing information data sets, stored in aninformation system, relevant to search criteria specifying an indicationof a category of information to be accessed and to a specifiedindication of a predetermined level of expertise in respect of saidcategory of information, the method comprising the steps of: (i)selecting a training set of information data sets comprising, for eachof a plurality of predetermined levels of expertise, a representativesample set of information data sets; (ii) determining, for each data setin the training set, the value of a metric indicative of the incidence,in a reference corpus of information, of terms comprised in the trainingdata set; (iii) using the values of said metric determined in step (ii)to train an information classifier to identify at least one of saidplurality of predetermined levels of expertise applicable to a giveninformation data set; (iv) applying an information searching algorithmto identify information data sets stored in said information systemrelevant to said specified category of information; and (v) using theclassifier trained at step (iii) to determine respective levels ofexpertise for information data sets identified at step (iv) andcomparing the determined levels of expertise with the specified level ofexpertise to thereby select relevant information data sets.
 8. Anapparatus for determining a level of expertise applicable to aninformation data set, the level of expertise being selected from aplurality of predetermined levels of expertise, the apparatuscomprising: an input for receiving an information data set; calculatingmeans arranged with access to a reference corpus of information tocalculate, for an information data set, the value of a metric indicativeof the incidence, in the reference corpus, of terms comprised in theinformation data set; a trainable classifier; and training means fortraining said classifier to identify, using a training set ofinformation data sets comprising, for each of said plurality ofpredetermined levels of expertise, a representative sample set ofinformation data sets and respective values of said metric, anapplicable level of expertise selected from said plurality ofpredetermined levels of expertise for a received information data set;wherein, in operation, on receipt of an information data set at saidinput, said calculating means are arranged to calculate a respectivevalue for said metric and to input the calculated value to saidtrainable classifier, trained by said training means, to determine andoutput an indication of at least one of said plurality of predeterminedlevels of expertise applicable to said received information data set. 9.An apparatus as in claim 8, wherein said metric comprises a combinedmeasure of the incidence within an information data set of termscomprised in the information data set and of the incidence of each saidterm in the reference corpus.
 10. An apparatus as in claim 8, whereinsaid training means are arranged to train said trainable classifierusing the steps of: (a) making distributions of normalised values ofsaid metric for data sets in each of the representative sample sets; and(b) for each of said predetermined levels of expertise, identifying fromsaid distributions a corresponding range of normalised values of saidmetric.
 11. An apparatus as in claim 8, wherein said trainableclassifier is arranged, after training by said training means, todetermine a measure of the probability that a particular one of saidplurality of predetermined levels of expertise is applicable to areceived information data set.
 12. An apparatus as in claim 8, whereinsaid calculating means are arranged to calculate a value for said metricby applying a stemming algorithm to stem terms of a respectiveinformation data set and by determining the relative incidence of thestemmed terms in the reference corpus.
 13. An information retrievalapparatus for accessing information data sets, stored in an informationsystem, relevant to received search criteria specifying an indication ofa category of information to be accessed and to a specified indicationof a predetermined level of expertise in respect of said category ofinformation, the apparatus comprising: calculating means arranged withaccess to a reference corpus of information to calculate, for aninformation data set, the value of a metric indicative of the incidence,in the reference corpus, of terms comprised in the information data set;a trainable classifier; training means for training said classifier toidentify, using a training set of information data sets comprising, foreach of a plurality of predetermined levels of expertise, arepresentative sample set of information data sets and respective valuesof said metric, an applicable level of expertise selected from saidplurality of predetermined levels of expertise for a given informationdata set; searching means for identifying information data sets in saidinformation system relevant to said specified category of information tobe accessed; and selecting means arranged to trigger said calculatingmeans to calculate values of said metric for information data setsidentified by said searching means, to input the values so calculated tosaid trainable classifier, trained by said training means, to determineand output respective applicable levels of expertise selected from saidplurality of predetermined levels of expertise, and to select, foraccess, information data sets from those identified by said searchingmeans having respectively determined levels of expertise that match saidspecified level of expertise.
 14. An information retrieval apparatus foraccessing information data sets, stored in an information system,relevant to received search criteria specifying an indication of acategory of information to be accessed and to a specified indication ofa predetermined level of expertise in respect of said category ofinformation, the apparatus comprising: calculating means arranged withaccess to a reference corpus of information to calculate, for aninformation data set, the value of a metric indicative of the incidence,in the reference corpus, of terms comprised in the information data set;an information classifier, trained, using, for each of a plurality ofpredetermined levels of expertise, a representative sample set oftraining information data sets and respective values of said metric, todetermine a level of expertise, selected from said plurality ofpredetermined levels of expertise, applicable to an information dataset; searching means for identifying information data sets in saidinformation system relevant to said specified category of information tobe accessed; and selecting means arranged to trigger said calculatingmeans to calculate values of said metric for information data setsidentified by said searching means, to input the values so calculated tosaid information classifier to determine and output respectiveapplicable levels of expertise selected from said plurality ofpredetermined levels of expertise, and to select, for access,information data sets from those identified by said searching meanshaving respectively determined levels of expertise that match saidspecified level of expertise.